Exploring Melville's Moby Dick using a Latent Dirichlet Allocation¶
This content is part of the lecture Data & Web Technologies for Data Analysis, which I taught at UC Davis from 2022 to 2025.
This notebook demonstrates the workings and of a Latent Dirichlet Allocation (LDA) to identify unobserved topics in bodies of text. As an example, we use Herman Melville's 19th century novel Moby Dick.
On the surface, Moby-Dick tells the story of Ishmael, a sailor aboard the whaling ship Pequod, and Captain Ahab’s obsessive hunt for the elusive white whale. But Melville didn't just write a seafaring tale—he filled the book with chapters on cetology, the study of whales. Back in the 1850s, most readers had never seen a whale, let alone understood the brutal business of whaling. Melville used these detours to explain the basics: types of whales, methods of hunting, and the sheer vastness of the sea. These sections are packed with philosophical musings and, occasionally, some dubious science. Modern readers often find them dense, and many guides suggest skipping them entirely since they don't move the plot forward.
For our purposes, though, this mix of storytelling and cetology is perfect. It’s reasonable to expect that different themes dominate the whaling lectures versus the narrative chapters. Using LDA, we'll try to tease apart these underlying topics and see if the model can distinguish between the story and the cetology.
Below, I will first introduce LDA and describe how it works. If you're not interested in the mathematical details and derivations, feel free to skip them—they help us understand how the model is fitted, but they’re not strictly necessary to follow what’s going on. I have marked these sections with an asterisk.
Latent Dirichlet Allocation¶
Latent Dirichlet Allocation (LDA) is a probabilistic model for collections of discrete data, first proposed by Blei et al. (2003). It was originally developed for text analysis. In this setting, the discrete data consists of words, which appear in documents, and those documents make up a corpus. Each document is assumed to contain a mixture of topics, which are latent—that is, they can’t be directly observed and must be inferred.
Here’s a quick overview of the basic terms:
- A word is the smallest unit (token or term) in the data, taken from a vocabulary of $D$ words. The $i$-th word in the dictionary can be represented as a one-hot vector: all entries are zero except at position $i$.
- A document is a sequence of $N$ words, denoted as $W = (w_1, \dots, w_N)$.
- A corpus is the full collection of $M$ documents: ${W_1, \dots, W_M}$.
Each document is a mixture of topics, and given a document’s topic distribution, LDA assumes that the words are generated based on those topics. This process is often called the data-generating mechanism. Formally, we assume the following process:
- Choose $\theta\sim Dir(\alpha)$, where $\theta, \alpha\in\mathbb{R}^k$.
- For each word $w_i$ in the document $W$, $i = 1, \dots, N$:
- Choose a topic $z_i\sim Mult(1,\theta)$,
- Choose a word from $P(w_i|z_i,\beta)$.
Here, $P(w_i|z_i,\beta)$ is again a multinomial probability with $n=1$, given topic $z_i$. We assume that the number of topics $k$ is known and will not random. The parameter $\beta\in\mathbb{R}^{k\times D}$, and $\beta_{ij}$ is the probability that word $j$ is chosen for topic $i$.
Why does the LDA work with the Dirichlet and Multinomial distributions? Because they form a conjugate hierarchy! Let’s revisit their properties before we move on. Again, feel free to skip the section below if math not your thing.
* Recap: Some Helpful Distributions¶
An LDA is modelled via a Bayesian hierarchy. It relies on the Dirichlet-Multinomial conjugacy.
Multinomial
Let $n,k\in\{0,1,2,...\}$ and $p = (p_1, \dots, p_k)'\in[0,1]^k$ with $\sum_{i=1}^kp_i=1$. $X\sim Mult(n,p)$ if $$ P(X_1=x_1, \dots, X_k = x_k) = \begin{cases} \frac{n!}{x_1\cdots x_k!}p_1^{x_1}\cdots p_k^{x_k}& \text{if}\sum_{i=1}^kx_i = n,\\ 0, &\text{else.} \end{cases} $$ The Multinomial distribution is a generalization of the Binomial distribution for the case of more than two outcomes.
Dirichlet
A Dirichlet distribution of order $k\geq2$ and parameters $\alpha_1, \dots, \alpha_k>0$ has the p.d.f. $$ f(p_1,\dots, p_k) = \frac{\Gamma(\sum_{i=1}^k\alpha_i)}{\prod_{i=1}^k\Gamma(\alpha_i)}\prod_{i=1}^kp_i^{\alpha_i-1} $$ for all $p_1, \dots, p_k\geq0$ with $\sum_{i=1}^kp_i=1$. This implies that the $p_i$ lie on a $k-1$-dimensional simplex. For $k=2$, the Dirichlet distribution coincides with the Beta distribution.
An important property of the Dirichlet distribution is that it's a conjugate prior to the Multiomial distribution. Consider the following hierarchy: \begin{align} X|p&\sim Mult(n,p)\\ p&\sim Dir(\alpha) \end{align} The parameter $p$ is not assumed to be fixed (as in a frequentist understanding) but random. Consequently, it cannot be estimated. The Bayesian methodology aims in computing the posterior distribution of $p$, given the observed data $X$: \begin{align} P(p|X) = \frac{P(X|p)P(p)}{P(X)} \end{align} One can show that the posterior probability is again a Dirichlet distribution! Of course, the posterior parameters will not be the same as the prior parameters. They will depend on $X$, which is precisely the idea: You update your prior belief once you observe some data.
An Introductory Example¶
Consider as corpus a magazine about housekeeping. This magazine consists of three articles (in our terminology, these correspond to documents), and each article consists of the topics home, garden, cooking and the words pan, plot, window and way.
Let's write a document! First:
- Choose $\theta\sim Dir(\alpha)$, where $\theta, \alpha\in\mathbb{R}^k$.
import numpy as np
topics = np.array(["home", "garden", "cooking"])
dictionary = np.array(["pan", "plot", "window", "way"])
For parameters $\alpha_i$, the parameter $\theta$ determines the probability of topics in each document.
alpha = [1, 2, 3] # Hyper-prior of the Dirichlet distribution. The "cooking"-topic is as likely as both other combined.
theta = np.random.dirichlet(alpha)
print(theta) # Our document will have a mixture of these topics
[0.3111458 0.27935267 0.40950154]
Given the topic distribution vector, we can now select a topic for a word:
- Choose a topic $z_i\sim Mult(1,\theta)$
zi = np.random.multinomial(1, theta) # topic for word
print(zi)
[0 1 0]
In order to draw the word given the selected topic, we need to specify the parameter $\beta$.
beta = np.array([[0.1, 0.02, 0.88, 0], # Topic home: word "window" is the most likely
[0.01, 0.79, 0.1, 0.1], # Topic garden: word "plot" is the most likely
[0.75, 0.15, 0.1, 0]]) # Topic cooking: word "pan" is the most likely
print(beta)
[[0.1 0.02 0.88 0. ] [0.01 0.79 0.1 0.1 ] [0.75 0.15 0.1 0. ]]
Finally,
- Choose a word from $P(w_i|z_i,\beta)$.
print(beta[zi==1,][0]) # Selected beta
[0.01 0.79 0.1 0.1 ]
w=np.random.multinomial(1,beta[zi==1,][0])# Draw from multinomial
print(dictionary[(np.where(w)[0][0])]) # Look up the word in dictionary
way
This process can now be repeated for all words in the document.
zi = np.random.multinomial(1, theta) # topic for word
wi = np.random.multinomial(1, beta[zi==1,][0])
print({'topic': str(topics[zi==1][0])})
print({'word': str(dictionary[wi==1][0])})
{'topic': 'cooking'}
{'word': 'plot'}
The complete corresponding Bayesian hierarchy is given as \begin{align} w_i|z_i,\beta_i&\sim Mult(1,\beta_i),\\ z_i|\theta &\sim Mult(1,\theta)m\\ \theta&\sim Dir(\alpha). \end{align}
We will now learn how to fit the model on observed data. Then, we will talk about Moby Dick!
* Fitting the LDA¶
If $\beta_i$ and $\alpha$ are known, the posterior probability of $\theta, z_i|w_i, \alpha, \beta_i$ could be approximated using MCMC methods. The method of empirical Bayes uses estimates of $\beta_i$ and $\alpha$ to do so:
Note that the joint distribution for a single word is given by $$P(\theta, z_i, w_i|\alpha_i, \beta_i) = P(w_i| z_i, \beta_i)P(z_i|\theta)P(\theta|\alpha), $$ and for a single document (let $Z = (z_1, \dots, z_N)$) thus by $$P(\theta, Z,W|\alpha, \beta) = \prod_{i=1}^N P(w_i| z_i, \beta_i)P(z_i|\theta)P(\theta|\alpha) = P(\theta|\alpha) \prod_{i=1}^N P(w_i| z_i, \beta_i)P(z_i|\theta).$$
Integrating over $\theta$ and summing over $Z$ gives the marginal distribution over this document: $$ P(W|\alpha, \beta) = \int P(\theta|\alpha) \prod_{i=1}^N \sum_{z_i}P(w_i| z_i, \beta_i)P(z_i|\theta) d\theta $$
The marginal distribution for the corpus $C$ is thus given as $$ P(C|\alpha, \beta) = \prod_{j=1}^M \int P(\theta|\alpha) \prod_{i=1}^N \sum_{z_{ij}}P(w_i| z_{ij}, \beta_i)P(z_{ij}|\theta) d\theta $$
Natural estimators for $\alpha$ and $\beta$ are found by solving $$ \max_{\alpha, \beta} P(C|\alpha, \beta). $$
While this behemoth of a function may not be maximized in closed form, numerical methods can be applied to obtain Empirical Bayes estimates for $\alpha$ and $\beta$.
Moby Dick 🐋¶
We will now consider each chapter of the novel as an individual document, so that the novel constitutes the entire corpus. The text itself is freely available online and part of the nltk.corpus.gutenberg submodule. Before we apply the LDA, some preprocessing is in order.
The following chunk retrieves the text, uses regex to split the novel into chapters and lowercases the language for standardization.
import nltk
import re
moby = nltk.corpus.gutenberg.raw("melville-moby_dick.txt")
pattern = r"\s*(?:EXTRACTS|ETYMOLOGY\.|CHAPTER \d+|Epilogue)\s+.+\n*.+[\.!\?\)]\s*"
corpus = re.split(pattern, moby)
corpus.pop(0)
corpus = [re.sub(r"\s+", " ", document).lower() for document in corpus]
The regular expression splits the entire text into chapters. For our purpose, I consider the two introductory sections 'Extracts' and 'Etymology' as well as the 'Epilogue' as chapters as well. The regular expression matches the Chapter (Extracts, etc.) pattern and the title until it ends with either a literal ., exclamation or question mark.
print(len(corpus)) # there are 135 chapters and three additional sections that we consider as documents, too.
138
print(corpus[2][:300])
call me ishmael. some years ago--never mind how long precisely--having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world. it is a way i have of driving off the spleen and regulating the circulati
There are too many stopwords present, which do not convey meaning. Also, I want to further simplify the text and stem the words, so that related words are being represented by the same token.
Note: The natural langugage processing choices are very much subjective. You can try out a different processing and encoding!
def myTokenizer(document): return re.findall("\w+", document)
corpus_tokenized = [myTokenizer(document) for document in corpus] # Tokenize
stopwords = nltk.corpus.stopwords.words("english") # Remove stopwords ...
stopwords.extend([',', '.', ':', '!', ';', '?', '--', '\'', '\'\'']) # ... I added some more stopwords
corpus_tokenized = [[word for word in document if word not in stopwords] for document in corpus_tokenized]
corpus_tokenized = [[nltk.PorterStemmer().stem(word) for word in document] for document in corpus_tokenized] # Stem
corpus_processed = [' '.join(document) for document in corpus_tokenized]
print(corpus_processed[2][:400])
call ishmael year ago never mind long precis littl money purs noth particular interest shore thought would sail littl see wateri part world way drive spleen regul circul whenev find grow grim mouth whenev damp drizzli novemb soul whenev find involuntarili paus coffin warehous bring rear everi funer meet especi whenev hypo get upper hand requir strong moral principl prevent deliber step street meth
This is not well readable, but ready to pass on to the LDA! It is always a good idea to check how large the data is before going forward.
words = [w for d in corpus_tokenized for w in d]
print(len(words)) # Total words
print(len(set(words))) # Unique words (Important: This is the dimension of object passed to the LDA!)
110166 10580
Now, we will use the LDA. We will have to reshape our corpus into a matrix. Since sklearn.decomposition.LatentDirichletAllocation does not allow for sparse matrices as input, we have to provide a dense matrix to the
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(tokenizer = myTokenizer)
freq = vec.fit_transform(corpus_processed)
corpus_freq = freq.todense() # Use absolute frequency encoding, you can try out Tfidf or One-hot as alternatives.
corpus_freq = np.array(corpus_freq) # sklearn.decomposition.LatentDirichletAllocation expects an array as input
print(corpus_freq.shape) # (chapters, dictionary)
print(vec.get_feature_names_out()) # These are the words in our dictionary
(138, 10580) ['000' '1' '10' ... 'zone' 'zoolog' 'zoroast']
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/feature_extraction/text.py:517: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
from sklearn.decomposition import LatentDirichletAllocation
ntopics = 2 # I except two topics - feel free to change and see the results
lda = LatentDirichletAllocation(n_components = ntopics,
random_state = 2025) # Set a seed for reproducibility
lda.fit(corpus_freq) # Fit the model
LatentDirichletAllocation(n_components=2, random_state=2025)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LatentDirichletAllocation(n_components=2, random_state=2025)
As for the fitted model, we can retrieve the fitted posterior probability for each chapter, namely $P(\theta_i|C)$, $i \in\{1, \dots, D\}$ via
posterior = lda.transform(corpus_freq) # Retrieve the posterior from the fitted LDA
Working with a Fitted LDA Model¶
Let us inspect the fitted model and see whether the hypothesized dichotomy between cetological and storyline chapters is reflected in the estimated topic distributions.
import pandas as pd
df = pd.DataFrame(posterior).reset_index()
df.columns = ['Chapter'] + ['Topic ' + str(i + 1) for i in range(0,ntopics)]
# Custom numbering, so that the document numbers align with the chapter numbering
df['Chapter'] = df['Chapter'] - 1
df = df.set_index('Chapter')
# Add title as well
chapter_name = re.findall(r"\s*(?:EXTRACTS|ETYMOLOGY\.|CHAPTER \d+|Epilogue)\s+.+\n*.+[\.!\?\)]\s*", moby)
chapter_name = [re.sub(r"\s+", " ", document) for document in chapter_name]
df['Title'] = chapter_name
df[:4]
| Topic 1 | Topic 2 | Title | |
|---|---|---|---|
| Chapter | |||
| -1 | 0.993416 | 0.006584 | ETYMOLOGY. (Supplied by a Late Consumptive Us... |
| 0 | 0.997943 | 0.002057 | EXTRACTS (Supplied by a Sub-Sub-Librarian). |
| 1 | 0.643041 | 0.356959 | CHAPTER 1 Loomings. |
| 2 | 0.229027 | 0.770973 | CHAPTER 2 The Carpet-Bag. |
The first two documents Etymology and Extracts in which Melville introduces some elementary facts about whales are dominated by Topic 1. The Storyline starts with Chapter 1 Loomings and the subsequent chapters, in which Topic 2 surges. It appears justified to think of Topic 1 as the cetology (or rather non-story) topic, and Topic 2 as the storyline topic.
Note: I have set a seed to ensure reproducibility. If you change the seed or include more topics, your results will vary.
Let's investigate if the topic distribution changes over the chapters. I want to know how well the LDA identifies between cetology and storyline chapters.
For starters, we expect no cetological chapters towards the end of the novel, where the action peaks with the final chase of the whale. For the remaining chapters, I found this resource, which on page 28 identifies some chapters providing factual information.
Note: This is purely to gauge how we are doing. The LDA is an unsupervised method and no labels are required to estimate the topics.
cetological = [0, 1, 46, 54, 59, 60, 61, 63, 64, 68, 75,
76, 77, 78, 80, 81, 85, 86, 87, 89, 90, 91,
93, 96, 97, 98, 102, 103, 104, 105, 106]
story = [i not in cetological for i in range(138)]
labels = [i for i in map(lambda x: 'story' if x else 'ceto', story)]
import plotly.express as px
df_melted = pd.melt(df.drop(columns="Title").reset_index(), id_vars='Chapter')
fig = px.line(df_melted, x="Chapter", y="value", color='variable', labels={
"Chapter": "Chapter",
"value": "Probability",
"variable": "LDA Topics"
}, title = 'LDA Probabilities: Labelled cetology chapters are shaded')
for i, e in enumerate(labels):
if e=='ceto':
fig.add_vrect(x0=i-0.5, x1=i+0.5, line_width=0, fillcolor="grey", opacity=0.2)
fig.show()
Indeed, Topic 2 (storyline) overwhelmingly dominates the later chapters of the novel. The shaded chapters, which correspond to labeled cetology chapters, tend to favor Topic 1 more than others. This suggests that Topic 1 represents the non-storyline topic, as expected. Interestingly, some non-shaded chapters are also dominated by Topic 1, indicating that non-narrative content occasionally appears outside the designated cetology sections.
Amongst those are, for example the ones selected below.
df.loc[[24, 25, 32, 33, 41, 42]]
| Topic 1 | Topic 2 | Title | |
|---|---|---|---|
| Chapter | |||
| 24 | 0.998623 | 0.001377 | CHAPTER 24 The Advocate. |
| 25 | 0.995213 | 0.004787 | CHAPTER 25 Postscript. |
| 32 | 0.999736 | 0.000264 | CHAPTER 32 Cetology. |
| 33 | 0.955865 | 0.044135 | CHAPTER 33 The Specksynder. |
| 41 | 0.972544 | 0.027456 | CHAPTER 41 Moby Dick. |
| 42 | 0.998942 | 0.001058 | CHAPTER 42 The Whiteness of The Whale. |
You can verify on your own here if you agree with the LDA. Amongst the highly non-storyline chapters is the infamous Chapter 32: Cetology, which we haven't labelled before. But also Chapter 41: Moby Dick is found to be dominated by Topic 1.
Even though this chapter features the tokens Ishmael, Ahab and Moby Dick, the narrator merely talks about ravaging whales and provides the background to Ahab's obsession. A more naive classification (based, e.g., on word frequencies) might have mis-classified this chapter. Due to the lack of action however, the LDA identifies this chapter as a cetology (or better: not-action) chapter.
A natural question at this point is to determine which word (token) is most likely for the given topic. We can return the corresponding hyperparamter $\beta$:
wordTopics = pd.DataFrame(lda.components_.T, index = vec.get_feature_names_out())
wordTopics = wordTopics.apply(lambda x: x / sum(x), 1) # Normalize probabilities
wordTopics.columns = ['Topic ' + str(i + 1) for i in range(0,ntopics)]
print(pd.concat([wordTopics[:3],
wordTopics[-3:]], axis=0))
Topic 1 Topic 2 000 0.976124 0.023876 1 0.749995 0.250005 10 0.874345 0.125655 zone 0.456187 0.543813 zoolog 0.833330 0.166670 zoroast 0.250046 0.749954
We can sort by topic and retrieve the most likely words per topic.
print(wordTopics['Topic 1'].sort_values(ascending = False).head(10))
porpois 0.980020 herd 0.976430 magnitud 0.976403 000 0.976124 folio 0.973655 speci 0.970560 gabriel 0.970251 ii 0.968749 naturalist 0.968460 squeez 0.965937 Name: Topic 1, dtype: float64
The most likely words for Topic 1 (cetology) contain stems like folio and speci, which likely only appear seldom and merely in cetological contexts.
print(wordTopics['Topic 2'].sort_values(ascending = False).head(10))
steelkilt 0.987804 shipmat 0.987166 ha 0.985972 flask 0.985249 aye 0.985058 ho 0.983798 bulwark 0.982898 parse 0.982594 kick 0.982138 doubloon 0.981167 Name: Topic 2, dtype: float64
On the other hand, the storyline (better: action) Topic 2 includes tokens that relate to action aboard the Pequod, like aye, ho and ha!
Conclusion¶
The application of LDA to Moby Dick demonstrates how unsupervised topic modeling can uncover meaningful structure in a (very) complex text. With only basic text processing and no labeling, we successfully identified a clear separation between action-driven narrative chapters (storyline) and the more expository cetology chapters. This supports our initial hypothesis and aligns with the experience of many readers.
More broadly, this example illustrates the power of LDA in exploratory text analysis. Whether applied to literature, news articles, or customer reviews, topic models can reveal latent themes, organize large collections of text, and support further analysis or classification. In our case, we even observed a shift in topic behavior as the novel moved toward its climax—a pattern that echoes how news topics might evolve over time. While interpretation always requires human judgment, LDA offers a principled and scalable way to uncover structure in unstructured data.